{
 "nbformat": 4,
 "nbformat_minor": 0,
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "name": "CTC_Segmentation_Tutorial.ipynb",
   "private_outputs": true,
   "provenance": [],
   "collapsed_sections": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "source": [],
    "metadata": {
     "collapsed": false
    }
   }
  }
 },
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "BRANCH = 'v1.0.0'"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "x0DJqotopcyb"
   },
   "source": [
    "\"\"\"\n",
    "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
    "\n",
    "Instructions for setting up Colab are as follows:\n",
    "1. Open a new Python 3 notebook.\n",
    "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",
    "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
    "4. Run this cell to set up dependencies.\n",
    "\"\"\"\n",
    "# If you're using Google Colab and not running locally, run this cell.\n",
    "# install NeMo\n",
    "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "CH7yR7cSwPKr"
   },
   "source": [
    "import json\n",
    "import os\n",
    "import wget\n",
    "\n",
    "from IPython.display import Audio\n",
    "import numpy as np\n",
    "import scipy.io.wavfile as wav\n",
    "\n",
    "! pip install pandas\n",
    "\n",
    "# optional\n",
    "! pip install plotly\n",
    "from plotly import graph_objects as go"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xXRARM8XtK_g"
   },
   "source": [
    "# Introduction\n",
    "End-to-end Automatic Speech Recognition (ASR) systems surpassed traditional systems in performance but require large amounts of labeled data for training. \n",
    "\n",
    "This tutorial will show how to use a pre-trained with Connectionist Temporal Classification (CTC) ASR model, such as [QuartzNet Model](https://arxiv.org/abs/1910.10261) to split long audio files and the corresponding transcripts into shorter fragments that are suitable for an ASR model training. \n",
    "\n",
    "We're going to use [ctc-segmentation](https://github.com/lumaku/ctc-segmentation) Python package based on the algorithm described in [CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition](https://arxiv.org/pdf/2007.09127.pdf)."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "8FAZKakrIyGI"
   },
   "source": [
    "! pip install ctc_segmentation==1.1.0\n",
    "! pip install num2words\n",
    "! apt-get install -y ffmpeg"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "UD-OuFmEOX3T"
   },
   "source": [
    "# If you're running the notebook locally, update the TOOLS_DIR path below\n",
    "# In Colab, a few required scripts will be downloaded from NeMo github\n",
    "\n",
    "TOOLS_DIR = '<UPDATE_PATH_TO_NeMo_root>/tools/ctc_segmentation/scripts'\n",
    "\n",
    "if 'google.colab' in str(get_ipython()):\n",
    "    TOOLS_DIR = 'scripts/'\n",
    "    os.makedirs(TOOLS_DIR, exist_ok=True)\n",
    "\n",
    "    required_files = ['prepare_data.py',\n",
    "                    'normalization_helpers.py',\n",
    "                    'run_ctc_segmentation.py',\n",
    "                    'verify_segments.py',\n",
    "                    'cut_audio.py',\n",
    "                    'process_manifests.py',\n",
    "                    'utils.py']\n",
    "    for file in required_files:\n",
    "        if not os.path.exists(os.path.join(TOOLS_DIR, file)):\n",
    "            file_path = 'https://raw.githubusercontent.com/NVIDIA/NeMo/' + BRANCH + '/tools/ctc_segmentation/' + TOOLS_DIR + file\n",
    "            print(file_path)\n",
    "            wget.download(file_path, TOOLS_DIR)\n",
    "elif not os.path.exists(TOOLS_DIR):\n",
    "      raise ValueError(f'update path to NeMo root directory')"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "S1DZk-inQGTI"
   },
   "source": [
    "`TOOLS_DIR` should now contain scripts that we are going to need in the next steps, all necessary scripts could be found [here](https://github.com/NVIDIA/NeMo/tree/main/tools/ctc_segmentation/scripts)."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "1C9DdMfvRFM-"
   },
   "source": [
    "print(TOOLS_DIR)\n",
    "! ls -l $TOOLS_DIR"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XUEncnqTIzF6"
   },
   "source": [
    "# Data Download\n",
    "First, let's download an audio file from [https://librivox.org/](https://librivox.org/)."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "bkeKX2I_tIgV"
   },
   "source": [
    "## create data directory and download an audio file\n",
    "WORK_DIR = 'WORK_DIR'\n",
    "DATA_DIR = WORK_DIR + '/DATA'\n",
    "os.makedirs(DATA_DIR, exist_ok=True)\n",
    "audio_file = 'childrensshortworks019_06acarriersdog_am_128kb.mp3'\n",
    "if not os.path.exists(os.path.join(DATA_DIR, audio_file)):\n",
    "    print('Downloading audio file')\n",
    "    wget.download('http://archive.org/download/childrens_short_works_vol_019_1310_librivox/' + audio_file, DATA_DIR)"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-_XE9MkKuAA7"
   },
   "source": [
    "Next, we need to get the corresponding transcript.\n",
    "\n",
    "Note, the text file and the audio file should have the same base name, for example, an audio file `example.wav` or `example.mp3` should have corresponding text data stored under `example.txt` file."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "3NSz3Qb7pzOe"
   },
   "source": [
    "# text source: http://www.gutenberg.org/cache/epub/24263/pg24263.txt\n",
    "text =  \"\"\"\n",
    "    A carrier on his way to a market town had occasion to stop at some houses\n",
    "    by the road side, in the way of his business, leaving his cart and horse\n",
    "    upon the public road, under the protection of a passenger and a trusty\n",
    "    dog. Upon his return he missed a led horse, belonging to a gentleman in\n",
    "    the neighbourhood, which he had tied to the end of the cart, and likewise\n",
    "    one of the female passengers. On inquiry he was informed that during his\n",
    "    absence the female, who had been anxious to try the mettle of the pony,\n",
    "    had mounted it, and that the animal had set off at full speed. The carrier\n",
    "    expressed much anxiety for the safety of the young woman, casting at the\n",
    "    same time an expressive look at his dog. Oscar observed his master's eye,\n",
    "    and aware of its meaning, instantly set off in pursuit of the pony, which\n",
    "    coming up with soon after, he made a sudden spring, seized the bridle, and\n",
    "    held the animal fast. Several people having observed the circumstance, and\n",
    "    the perilous situation of the girl, came to relieve her. Oscar, however,\n",
    "    notwithstanding their repeated endeavours, would not quit his hold, and\n",
    "    the pony was actually led into the stable with the dog, till such time as\n",
    "    the carrier should arrive. Upon the carrier entering the stable, Oscar\n",
    "    wagged his tail in token of satisfaction, and immediately relinquished the\n",
    "    bridle to his master.\n",
    "    \"\"\"\n",
    "\n",
    "with open(os.path.join(DATA_DIR, audio_file.replace('mp3', 'txt')), 'w') as f:\n",
    "    f.write(text)"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yyUE_t4vw2et"
   },
   "source": [
    "The `DATA_DIR` should now contain both audio and text files:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "VXrTzTyIpzE8"
   },
   "source": [
    "!ls -l $DATA_DIR"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FWqlbSryw_WL"
   },
   "source": [
    "Listen to the audio:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "1vC2DHawIGt8"
   },
   "source": [
    "Audio(os.path.join(DATA_DIR, audio_file))"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RMT5lkPYzZHK"
   },
   "source": [
    "As one probably noticed, the audio file contains a prologue and an epilogue that are missing in the corresponding text. The segmentation algorithm could handle extra audio fragments at the end and the beginning of the audio, but prolonged untranscribed audio segments in the middle of the file could deteriorate segmentation results. That's why to improve the segmentation quality, it is recommended to normalize text, so that transcript contains spoken equivalents of abbreviations and numbers.\n",
    "\n",
    "# Prepare Text and Audio\n",
    "\n",
    "We're going to use `prepare_data.py` script to prepare both text and audio data for segmentation.\n",
    "\n",
    "Text preprocessing:\n",
    "* the text will be roughly split into sentences and stored under '$OUTPUT_DIR/processed/*.txt' where each sentence is going to start with a new line (we're going to find alignments for these sentences in the next steps)\n",
    "* to change the lenghts of the final sentences/fragments, use `min_length` and `max_length` arguments, that specify min/max number of chars of the text segment for alignment.\n",
    "* to specify additional punctuation marks to split the text into fragments, use `--additional_split_symbols` argument. If segments produced after splitting the original text based on the end of sentence punctuation marks is longer than `--max_length`, `--additional_split_symbols` are going to be used to shorten the segments. Use `|` as a separator between symbols, for example: `--additional_split_symbols=;|:|` \n",
    "* out-of-vocabulary words will be removed based on pre-trained ASR model vocabulary, (optionally) text will be changed to lowercase \n",
    "* sentences for alignment with the original punctuation and capitalization will be stored under  `$OUTPUT_DIR/processed/*_with_punct.txt`\n",
    "* numbers will be converted from written to their spoken form with `num2words` package. To use NeMo normalization tool use `--use_nemo_normalization` argument (not supported if running this segmentation tutorial in Colab, see [the text normalization tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tools/Text_Normalization_Tutorial.ipynb) for more details). Such normalization is usually enough for proper segmentation but to build a high-quality training dataset, all out-vocabulary symbols should be replaced with their actual spoken representations.\n",
    "\n",
    "Audio preprocessing:\n",
    "* `.mp3` files will be converted to `.wav` files\n",
    "* audio files will be resampled to use the same sampling rate as was used to pre-train the ASR model we're using for alignment\n",
    "* stereo tracks will be converted to mono\n",
    "* since librivox.org audio contains relatively long prologues, we're also cutting a few seconds from the beginning of the audio files (optional step, see `--cut_prefix` argument). In some cases, if an audio contains a very long untranscribed prologue, increasing `--cut_prefix` value might help improve segmentation quality.\n",
    "\n",
    "\n",
    "The `prepare_data.py` will preprocess all `.txt` files found in the `--in_text=$DATA_DIR` and all `.mp3` files located at `--audio_dir=$DATA_DIR`.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "u4zjeVVv-UXR"
   },
   "source": [
    "MODEL = 'QuartzNet15x5Base-En'\n",
    "OUTPUT_DIR = WORK_DIR + '/output'\n",
    "\n",
    "! python $TOOLS_DIR/prepare_data.py \\\n",
    "--in_text=$DATA_DIR \\\n",
    "--output_dir=$OUTPUT_DIR/processed/ \\\n",
    "--language='eng' \\\n",
    "--cut_prefix=3 \\\n",
    "--model=$MODEL \\\n",
    "--audio_dir=$DATA_DIR"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kmDTCuTLH7pm"
   },
   "source": [
    "The following four files should be generated and stored at the `$OUTPUT_DIR/processed` folder:\n",
    "* childrensshortworks019_06acarriersdog_am_128kb.txt\n",
    "* childrensshortworks019_06acarriersdog_am_128kb.wav\n",
    "* childrensshortworks019_06acarriersdog_am_128kb_with_punct.txt\n",
    "* childrensshortworks019_06acarriersdog_am_128kb_with_punct_normalized.txt"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "6R7OKAsYH9p0"
   },
   "source": [
    "! ls -l $OUTPUT_DIR/processed"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bIvKBwRcH_9W"
   },
   "source": [
    "The `.txt` file without punctuation contains preprocessed text phrases that we're going to align within the audio file. Here, we split the text into sentences. Each line should contain a text snippet for alignment."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "74GLpMgoICmk"
   },
   "source": [
    "with open(os.path.join(OUTPUT_DIR, 'processed', audio_file.replace('.mp3', '.txt')), 'r') as f:\n",
    "    for line in f:\n",
    "        print (line)"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QrvZAjeoR9U1"
   },
   "source": [
    "# Run CTC-Segmentation\n",
    "\n",
    "In this step, we're going to use the [`ctc-segmentation`](https://github.com/lumaku/ctc-segmentation) to find the start and end time stamps for the segments we created during the previous step.\n",
    "\n",
    "\n",
    "As described in the [CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition](https://arxiv.org/pdf/2007.09127.pdf), the algorithm is relying on a CTC-based ASR model to extract utterance segments with exact time-wise alignments. For this tutorial, we're using a pre-trained 'QuartzNet15x5Base-En' model."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "xyKtaqAd-Tvk"
   },
   "source": [
    "WINDOW = 8000\n",
    "\n",
    "! python $TOOLS_DIR/run_ctc_segmentation.py \\\n",
    "--output_dir=$OUTPUT_DIR \\\n",
    "--data=$OUTPUT_DIR/processed \\\n",
    "--model=$MODEL \\\n",
    "--window_len=$WINDOW \\\n",
    "--no_parallel"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wY27__e3HmhH"
   },
   "source": [
    "`WINDOW` parameter might need to be adjusted depending on the length of the utterance one wants to align, the default value should work in most cases.\n",
    "\n",
    "Let's take a look at the generated alignments.\n",
    "The expected output for our audio sample with 'QuartzNet15x5Base-En' model looks like this:\n",
    "\n",
    "```\n",
    "<PATH_TO>/processed/childrensshortworks019_06acarriersdog_am_128kb.wav\n",
    "16.03 32.39 -4.5911999284929115 | a carrier on ... a trusty dog. | ...\n",
    "33.31 45.01 -0.22886803973405373 | upon his ... passengers. | ...\n",
    "46.17 58.57 -0.3523662826061572 | on inquiry ... at full speed. | ...\n",
    "59.75 69.43 -0.04128918756038118 | the carrier ... dog. | ...\n",
    "69.93 85.31 -0.3595261826390344 | oscar observed ... animal fast. | ...\n",
    "85.95 93.43 -0.04447770533708611 | several people ... relieve her. | ...\n",
    "93.61 105.95 -0.07326174931639003 | oscar however ... arrive. | ...\n",
    "106.65 116.91 -0.14680841514778062 | upon the carrier ... his master. | ...\n",
    "```\n",
    "\n",
    "Details of the file content:\n",
    "- the first line of the file contains the path to the original audio file\n",
    "- all subsequent lines contain:\n",
    "  * the first number is the start of the segment (in seconds)\n",
    "  * the second one is the end of the segment (in seconds)\n",
    "  * the third value - alignment confidence score (in log space)\n",
    "  * text fragments corresponding to the timestamps\n",
    "  * original text without pre-processing\n",
    "  * normalized text"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "ktBAsfJRVCwI"
   },
   "source": [
    "alignment_file = str(WINDOW) + '_' + audio_file.replace('.mp3', '_segments.txt')\n",
    "! cat $OUTPUT_DIR/segments/$alignment_file"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xCwEFefHZz1C"
   },
   "source": [
    "Finally, we're going to split the original audio file into segments based on the found alignments. We're going to create three subsets and three corresponding manifests:\n",
    "* high scored clips (segments with the segmentation score above the threshold value, default threshold value = -5)\n",
    "* low scored clips (segments with the segmentation score below the threshold)\n",
    "* deleted segments (segments that were excluded during the alignment. For example, in our sample audio file, the prologue and epilogue that don't have the corresponding transcript were excluded. Oftentimes, deleted files also contain such things as clapping, music, or hard breathing. \n",
    "\n",
    "The alignment score values depend on the pre-trained model quality and the dataset, the `THRESHOLD` parameter might be worth adjusting based on the analysis of the low/high scored clips.\n",
    "\n",
    "Also note, that the `OFFSET` parameter is something one might want to experiment with since timestamps have a delay (offset) depending on the model.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "6YM64RPlitPL"
   },
   "source": [
    "OFFSET = 0\n",
    "THRESHOLD = -5\n",
    "\n",
    "! python $TOOLS_DIR/cut_audio.py \\\n",
    "--output_dir=$OUTPUT_DIR \\\n",
    "--model=$MODEL \\\n",
    "--alignment=$OUTPUT_DIR/segments/ \\\n",
    "--threshold=$THRESHOLD \\\n",
    "--offset=$OFFSET"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QoyS0T8AZxcx"
   },
   "source": [
    "`manifests` folder should be created under `OUTPUT_DIR`, and it should contain\n",
    "corresponding manifests for the three groups of clips described above:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "1UaSIflBZwaV"
   },
   "source": [
    "! ls -l $OUTPUT_DIR/manifests"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "F-nPT8z_IVD-"
   },
   "source": [
    "def plot_signal(signal, sample_rate):\n",
    "    \"\"\" Plot the signal in time domain \"\"\"\n",
    "    fig_signal = go.Figure(\n",
    "        go.Scatter(x=np.arange(signal.shape[0])/sample_rate,\n",
    "                   y=signal, line={'color': 'green'},\n",
    "                   name='Waveform',\n",
    "                   hovertemplate='Time: %{x:.2f} s<br>Amplitude: %{y:.2f}<br><extra></extra>'),\n",
    "        layout={\n",
    "            'height': 200,\n",
    "            'xaxis': {'title': 'Time, s'},\n",
    "            'yaxis': {'title': 'Amplitude'},\n",
    "            'title': 'Audio Signal',\n",
    "            'margin': dict(l=0, r=0, t=40, b=0, pad=0),\n",
    "        }\n",
    "    )\n",
    "    fig_signal.show()\n",
    "    \n",
    "def display_samples(manifest):\n",
    "    \"\"\" Display audio and reference text.\"\"\"\n",
    "    with open(manifest, 'r') as f:\n",
    "        for line in f:\n",
    "            sample = json.loads(line)\n",
    "            sample_rate, signal = wav.read(sample['audio_filepath'])\n",
    "            plot_signal(signal, sample_rate)\n",
    "            display(Audio(sample['audio_filepath']))\n",
    "            display('Reference text:       ' + sample['text_no_preprocessing'])\n",
    "            display('ASR transcript: ' + sample['transcript'])\n",
    "            print('\\n' + '-' * 110)"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "S69UFA30ZvxV"
   },
   "source": [
    "Let's examine the high scored segments we obtained.\n",
    "\n",
    "The `Reference text` in the next cell represents the original text without pre-processing, while `ASR transcript` is an ASR model prediction with greedy decoding. Also notice, that `ASR transcript` in some cases contains errors that could decrease the alignment score, but usually it doesn’t hurt the quality of the aligned segments."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "Q45uBtsHIaAD"
   },
   "source": [
    "high_score_manifest = str(WINDOW) + '_' + audio_file.replace('.mp3', '_high_score_manifest.json')\n",
    "display_samples(os.path.join(OUTPUT_DIR, 'manifests', high_score_manifest))"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "! cat $OUTPUT_DIR/manifests/$high_score_manifest"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yivXpD25T4Ir"
   },
   "source": [
    "# Multiple files alignment\n",
    "\n",
    "Up until now, we were processing only one file at a time, but to create a large dataset processing of multiple files simultaneously could help speed up things considerably. \n",
    "\n",
    "Let's download another audio file and corresponding text."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "KRc9yMjPXPgj"
   },
   "source": [
    "# https://librivox.org/frost-to-night-by-edith-m-thomas/\n",
    "audio_file_2 = 'frosttonight_thomas_bk_128kb.mp3'\n",
    "if not os.path.exists(os.path.join(DATA_DIR, audio_file_2)):\n",
    "    print('Downloading audio file')\n",
    "    wget.download('http://www.archive.org/download/frost_to-night_1710.poem_librivox/frosttonight_thomas_bk_128kb.mp3', DATA_DIR)\n",
    "\n",
    "\n",
    "# text source: text source: https://www.bartleby.com/267/151.html\n",
    "text =  \"\"\"\n",
    "    APPLE-GREEN west and an orange bar,\t\n",
    "    And the crystal eye of a lone, one star …\t\n",
    "    And, “Child, take the shears and cut what you will,\t\n",
    "    Frost to-night—so clear and dead-still.”\t\n",
    "    \n",
    "    Then, I sally forth, half sad, half proud,\t        \n",
    "    And I come to the velvet, imperial crowd,\t\n",
    "    The wine-red, the gold, the crimson, the pied,—\t\n",
    "    The dahlias that reign by the garden-side.\t\n",
    "    \n",
    "    The dahlias I might not touch till to-night!\t\n",
    "    A gleam of the shears in the fading light,\t        \n",
    "    And I gathered them all,—the splendid throng,\t\n",
    "    And in one great sheaf I bore them along.\n",
    "    .    .    .    .    .    .\n",
    "    \n",
    "    In my garden of Life with its all-late flowers\t\n",
    "    I heed a Voice in the shrinking hours:\t\n",
    "    “Frost to-night—so clear and dead-still” …\t        \n",
    "    Half sad, half proud, my arms I fill.\t\n",
    "    \"\"\"\n",
    "\n",
    "with open(os.path.join(DATA_DIR, audio_file_2.replace('mp3', 'txt')), 'w') as f:\n",
    "  f.write(text)"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "YhLj6hZaFP_S"
   },
   "source": [
    "`DATA_DIR` should now contain two .mp3 files and two .txt files:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "wpwWgZ5InuQX"
   },
   "source": [
    "! ls -l $DATA_DIR"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "hlxG3bOSnHZR"
   },
   "source": [
    "Audio(os.path.join(DATA_DIR, audio_file_2))"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3ftilXu-5tzT"
   },
   "source": [
    "Finally, we need to download a script to perform all the above steps starting from the text and audio preprocessing to segmentation and manifest creation in a single step."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "KSwsrkbru1s5"
   },
   "source": [
    "if 'google.colab' in str(get_ipython()) and not os.path.exists('run_sample.sh'):\n",
    "    wget.download('https://raw.githubusercontent.com/NVIDIA/NeMo/' + BRANCH + '/tools/ctc_segmentation/run_sample.sh', '.')"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`run_sample.sh` script takes `DATA_DIR` argument and assumes that it contains folders `text` and `audio`.\n",
    "An example of the `DATA_DIR` folder structure:\n",
    "\n",
    "\n",
    "--DATA_DIR\n",
    "\n",
    "     |----audio\n",
    "            |---1.mp3\n",
    "            |---2.mp3\n",
    "            \n",
    "     |-----text\n",
    "            |---1.txt\n",
    "            |---2.txt\n",
    "            \n",
    "Let's move our files to subfolders to follow the above structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! mkdir $DATA_DIR/text && mkdir $DATA_DIR/audio\n",
    "! mv $DATA_DIR/*txt $DATA_DIR/text/. && mv $DATA_DIR/*mp3 $DATA_DIR/audio/.\n",
    "! ls -l $DATA_DIR"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nYXNvBDsHMEu"
   },
   "source": [
    "Next, we're going to execute `run_sample.sh` script to find alignment for two audio/text samples. By default, if the alignment is not found for an initial WINDOW size, the initial window size will be doubled a few times to re-attempt alignment. \n",
    "\n",
    "`run_sample.sh` applies two initial WINDOW sizes, 8000 and 12000, and then adds segments that were similarly aligned with two window sizes to `verified_segments` folder. This could be useful to reduce the amount of manual work while checking the alignment quality."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "hRFAl0gO92bp"
   },
   "source": [
    "if 'google.colab' in str(get_ipython()):\n",
    "    OUTPUT_DIR_2 = f'/content/{WORK_DIR}/output_multiple_files'\n",
    "else:\n",
    "    OUTPUT_DIR_2 = os.path.join(WORK_DIR, 'output_multiple_files')\n",
    "\n",
    "! bash $TOOLS_DIR/../run_sample.sh \\\n",
    "--MODEL_NAME_OR_PATH=$MODEL \\\n",
    "--DATA_DIR=$DATA_DIR \\\n",
    "--OUTPUT_DIR=$OUTPUT_DIR_2 \\\n",
    "--SCRIPTS_DIR=$TOOLS_DIR \\\n",
    "--CUT_PREFIX=3 \\\n",
    "--MIN_SCORE=$THRESHOLD  \\\n",
    "--USE_NEMO_NORMALIZATION=False"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zzJTwKq2Kl9U"
   },
   "source": [
    "High scored manifests for the data samples were aggregated to the `all_manifest.json` under `OUTPUT_DIR_2`."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "id": "nacE_iQ2_85L"
   },
   "source": [
    "display_samples(os.path.join(OUTPUT_DIR_2, 'all_manifest.json'))"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lcvT3P2lQ_GS"
   },
   "source": [
    "# Next Steps\n",
    "\n",
    "Check out [NeMo Speech Data Explorer tool](https://github.com/NVIDIA/NeMo/tree/main/tools/speech_data_explorer#speech-data-explorer) to interactively evaluate the aligned segments."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GYylwvTX2VSF"
   },
   "source": [
    "# References\n",
    "Kürzinger, Ludwig, et al. [\"CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition.\"](https://arxiv.org/abs/2007.09127) International Conference on Speech and Computer. Springer, Cham, 2020."
   ]
  }
 ]
}